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Abstract 

We study coresets for various types of range counting queries on uncertain data. In our model each 
uncertain point has a probability density describing its location, sometimes defined as k distinct locations. 
Our goal is to construct a subset of the uncertain points, including their locational uncertainty, so that 

CO range counting queries can be answered by just examining this subset. We study three distinct types 

of queries. RE queries return the expected number of points in a query range. RC queries return the 
number of points in the range with probability at least a threshold. RQ queries returns the probability 
that fewer than some threshold fraction of the points are in the range. In both RC and RQ coresets the 
Q^ threshold is provided as part of the query. And for each type of query we provide coreset constructions 

with approximation-size tradeoffs. We show that random sampling can be used to construct each type 
of coreset, and we also provide significantly improved bounds using discrepancy-based approaches on 

"^ axis-aligned range queries. 
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(^ 1 Introduction 

u 

• A powerful notion in computational geometry is the coreset ll3l|2j[10l|45l. Given a large data set P and a 

O family of queries A, then an rj-coreset is a subset S C P such that for all r ^ A that ||r(P) — r{S)\\ < r] 

(note the notion of distance || • || between query results is problem specific and is intentionally left ambiguous 

for now). Initially used for smallest enclosing ball queries lITOl and perhaps most famous in geometry for 

extent queries as ?7-kernels ll3j|2l, the coreset is now employed in many other problems such as clustering [6| 

■^ and density estimation f45\ . Techniques for constructing coresets are becoming more relevant in the era of 

CN big data; they summarize a large data set P with a proxy set S of potentially much smaller size that can 

^ guarantee error for certain classes of queries. They also shed light onto the limits of how much information 

^ can possibly be represented in a small set of data. 

^f^ In this paper we focus on a specific type of coreset called an r^-sample f^Sj |2T1 [SI that can be thought of 

T-H as preserving density queries and that has deep ties to the basis of learning theory Q. Given a set of objects 

^ X (often X C M'^ is a point set) and a family of subsets A of X, then the pair {X,A) is called an range 

space. Often A are specified by containment in geometric shapes, for instance as all subsets of X defined by 

inclusion in any ball, any half space, or any axis-aligned rectangle. Now an r/-sample of {X, A) is a single 

subset S C X such that 
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For any query range r G yi, subset S approximates the relative density of X in r with error at most t]. 

Uncertain points. Another emerging notion in data analysis is modeling uncertainty in points. There are 
several formulations of these problems where each point p G P has an independent probability distribution 
/ip describing its location and such a point is said to have locational uncertainty. Imprecise points (also 
called deterministic uncertainty) model where a data point p ^ P could be anywhere within a fixed contin- 
uous range and were originally used for analyzing precision errors. The worst case properties of a point set 
P under the imprecise model have been well-studied |[T9l|20l|7l|22l[33l|36l[37l|44l|30l. Indecisive points (or 
attribute uncertainty in database literature POl ) model each pj G P as being able to take one of k distinct 



locations {pi^i, pj,2 , • • • , Pi,k} with possibly different probabilities, modeling when multiple readings of the 
same object have been made ll26l l4tl [TTl [T^ [Tl l4l . 

We also note another common model of existential uncertainty (similar to tuple uncertainty in database 
literature ["401 but a bit less general) where the location or value of each p G P is fixed, but the point may 
not exist with some probability, modeling false readings [28l l27ll40l[T7l . 

We will focus mainly on the indecisive model of locational uncertainty since it comes up frequently in 
real-world applications B3l IH (when multiple readings of the same object are made, and typically k is 
small) and can be used to approximately represent more general continuous representations |[25ll38ll . 

1.1 Problem Statement 

Combining these two notions leads to the question: can we create a coreset (specifically for r/-samples) 
of uncertain input data? A few more definitions are required to rigorously state this question. In fact, we 
develop three distinct notions of how to define the coreset error in uncertain points. One corresponds to 
range counting queries, another to querying the mean, and the third to querying the median (actually it 
approximates the rank for all quantiles). 

For an uncertain point set P = {pi,p2, • • • ,Pn} with each pi = {pi,i,pi^2, ■ ■ ■Pi,k} C M"' we say that 
(5 (£ P is a transversal if Q G pi x p2 x • • • x Pn- I-e-> Q = iQi,Q2, ■ ■ ■ , Qn) is an instantiation of the 
uncertain data P and can be treated as a "certain" point set, where each qi corresponds to the location oipi. 
PrQ^p[({Q)], (resp. Eq^p[C,{Q)]) represents the probability (resp. expected value) of an event C{Q) where 
Q is instantiated from P according to the probability distribution on the uncertainty in P. 

As stated, our goal is to construct a subset of uncertain points T C P (including the distribution of each 
point p's location, fj,p) that preserves specific properties over a family of subsets {P,A). For completeness, 
the first variation we list cannot be accomplished purely with a coreset as it requires r2(n) space. 

• Range Reporting (RR) Queries support queries of a range r £ A and a threshold r, and return all 
Pi G P such that Pr Q,^p[qi £ r] > t. Note that the fate of each pi G P depends on no other pj G P 
where i / j, so they can be considered independently. Building indexes for this model have been 
studied IIIIIIIIBIII^ and effectively solved in M^ Q. 

• Range Expectation (RE) Queries consider a range r G ^l and report the expected number of uncertain 
points in r, EQ^p[\r n Q\]. The linearity of expectation allows summing the individual expectations 
each point p £ P isinr. Single queries in this model have also been studied I 23l [TTl 124.1 . 

• Range Counting (RC) Queries support queries of a range r £ A and a threshold r, but only return 
the number of pi £ P which satisfy Pr Q^p[qi G r] > r. The effect of each pj G P on the query 
is separate from that of any other pj G P where i ^ j. A random sampling heuristic (46\ has been 
suggested without proof of accuracy. 

• Range Quantile (RQ) Queries take a query range r G ^l, and report the full cumulative density 
function on the number of points in the range PrQgp[|r n Q\]. Thus for a query range r, this returned 
structure can produce for any value r G [0, 1] the probability that rn or fewer points are in r. Since 
this is no longer an expectation, the linearity of expectation cannot be used to decompose this query 
along individual uncertain points. 

Across all queries we consider, there are two main ways we can approximate the answers. The first and 
most standard way is to allow an e-error (for < e < 1) in the returned answer for RQ, RE, and RC. 
The second way is to allow an o-error in the threshold associated with the query itself. As will be shown, 
this is not necessary for RR, RE, or RC, but is required to get useful bounds for RQ. Finally, we will also 
consider probabilistic error 6, demarcating the probability of failure in a randomized algorithm (such as 
random sampling). We strive to achieve these approximation factors with a small size coreset T C P as 
follows: 



RE: For a given range r, let r( (5) = |(5nr|/|(5|, andlet ^^^(P) = EQ^p[r{Q)].T C P is an e- RE coreset 

of (P, A) if for all queries r G yi we have | -E'r(P) ~ l^r{T) | ^ ^• 
RC: For a range r ^ A, let Gp^ri'^) = rpr | {pi G P \ ^^QmRiQi € r] > r} | be the fraction of points in P 

that are in r with probability at least some threshold r. Then T C P is an e-RC coreset of (P, ^l) if 

for all queries r ^ A and all r G [0, 1] we have |Gp,,(r) — GT,r(''")| ^ £• 

RQ: For a range r ^ A, let Fp^rir) = PrQ^p[r{Q) < r] = Prggp ^^r^ < r be the probabihty that 
at most a r fraction of P is in r. Now T C P is an fe, a)-RQ coreset of (P, A) if for all r G A and 
r G [0, 1] there exists a 7 G [t — a,T + a] such that |Pp^r(''") — Fr.rij)] < £• In such a situation, we 
also say that Pj-.r is an (e, a)-quantization of Pp,r- 

A natural question is whether we can construct a (e, 0)-RQ coreset where there is not a secondary a-error 
term on r. We demonstrate that there are no useful non-trivial bounds on the size of such a coreset. 

When the (e, a)-quantization Ft^t need not be explicitly represented by a coreset T, then Loffler and 
Phillips ||32l |26l show a different small space representation that can replace it in the above definition 
of an (e,a)-RQ coreset with probability at least 1 — 5. First randomly create m = 0((l/e^) log(l/5)) 
transversals Qi, Q2, . . . , Qm, and for each transversal Qi create an a-sample Si of {Qi,A). Then to satisfy 
the requirements of Fx^riT), there exists some 7 G [t — a,T + a] such that we can return {l/m)\{Si \ 
i"{Si) < 7}|, and it will be within e of Fp^r{T)- However, this is subverting the attempt to construct and 
understand a coreset to answer these questions. A coreset T (our goal) can be used as proxy for P as 
opposed to querying m distinct point sets. This alternate approach also does not shed light into how much 
information can be captured by a small size point set, which is provided by bounds on the size of a coreset. 

Simple example. We illustrate a simple example with k = 2 and d = 1, where n = 10 and the nk = 20 
possible locations of the 10 uncertain points are laid out in order: 

Pl,l < P2,l < P3,l < Pi,l < P5,l < P6,l < P3,2 < P7,l < P8,l < P8,2 < P9,l < PlO,l < P5,2 < PW,2 < P2,2 
< P9,2 < P7,2 < P4,2 < P6,2 < Pl,2- 

We consider a coreset T C P that consists of the uncertain points T = {pi,P3,P5,P7,P9}- Now consider 
a specific range r G J+, a one-sided interval that contains ps 2 and smaller points, but not pio,2 and larger 
points. We can now see that Pp j, is an (e' = 0.1016, a = 0.1)-quantization of Fp^^ in Figureflj this follows 
since at Pp,r(0.75) = 0.7734 either FT,r{x) is at most 0.5 for x G [0.65,0.8) and is at least 0.875 for 
X G [0.8,0.85]. Also observe that 



\Fr{P) - Fr(T)\ 
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When these errors (the {e', a) -quantization and e-error) hold for all ranges in some range space, then T is 
an (e', a)-RQ coreset or e-RC coreset, respectively. 

To understand the error associated with an RC coreset, also consider the threshold r = 2/3 with respect to 
the range r. Then in range r, 2/10 of the uncertain points from P are in r with probability at least r = 2/3 
(points p3 and ps). Also 1/5 of the uncertain points from T are in r with probability at least r = 2/3 (only 
point ps). So there is RC error for this range and threshold. 

1.2 Our Results 

We provide the first results for RE-, RC-, and RQ-coresets with guarantees. In particular we show that a 
random sample T of size 0((l/e^)(i/ + log{l/6)) with probability 1 — 5 is an e-RC coreset for any family 
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Figure 1: Example cumulative density functions (Fx^r, in red with fewer steps, and Fp^r, in blue with more 
steps) on uncertain point set P and a coreset T for a specific range. 

of ranges A whose associated range space has VC-dimension u. Otherwise we enforce that each uncertain 
point has A; possible locations, then a sample T of size 0{{l/e'^){i' + log{k/5)) suffices for an e-RE coreset. 

Then we leverage discrepancy-based techniques |[35l IT2i for some specific families of ranges A, to im- 
prove these bounds to 0((l/e)poly(A:,log(l/e))). This is an important improvement since 1/e can be 
quite large (say 100 or more), while k, interpreted as the number of readings of a data point, is small 
for many applications (say 5). In M}, for one-sided ranges we construct e-RE and e-RC coresets of size 
0{{Vk/e)log{k/e)). For axis-aligned rectangles in M.'^ we construct e-RE coresets of size 0{{Vk/e) ■ 
log~2~(^/e)) and e-RC coresets of size 0((fc^'^"'"2/e)log^'^~2 (fc/g))). Finally, we show that any e-RE 
coreset of size t is also an (e, ae^i)-RQ coreset with value ae^t = £ + \/{l/2t) ln(2/e). 

These results leverage new connections between uncertain points and both discrepancy of permutations 
and colored range searching that may be of independent interest. 



2 Discrepancy and Permutations 

The key tools we will use to construct small coresets for uncertain data is discrepancy of range spaces, and 
specifically those defined on permutations. Consider a set X, a range space {X,A), and a coloring x ■ 
X — ^ { — 1, +1}- Then for some range A e A, the discrepancy is defined disc^(X, A) = \ J2x&xr\A x{x)\- 
We can then extend this to be over all ranges d\sc^^(X,A) = maxy^g^i d\sc^{X, A) and over all colorings 
d\5c{X,A) = min;(,disC;^(X,yi). 

Consider a ground set {P, S^) where P is a set of n objects, and S^ = {ai, (T2, . . . , ak} is a set of k 
permutations over P so each aj : P ^ [n] . We can also consider a family of ranges 3k as a set of intervals 
defined on one of the k permutations so Ix.yj £ Jfc is defined so P n Ix,y,j = {p ^ P \ x < aj{p) < y} for 



re < y G [0, n] and j G [k]. The pair ((P, S^), J^) is tiien a range space, defining a set of subsets of P. 

A canonical way to obtain k permutations from an uncertain point set P = {pi , p2 , • • • , Pn } is as follows. 
Define the jth canonical traversal of P as the set Pj = Uf^^pij. When each pij G M}, the sorted order 
of each canonical traversal Pj defines a permutation on P as aj{pi) = \{pi'j G Pj \ Pi'.j < Pi,j}\^ that is 
aj{pi) describes how many locations (including pij) in the traversal Pj have value less than or equal to pij. 
In other words, aj describes the sorted order of the jth point among all uncertain points. Then, given an 
uncertain point set, let the canonical traversals define the canonical k-permutation as (P, S^). 

A geometric view of the permutation range space embeds P as n fixed points in M.^ and considers ranges 
which are defined by inclusion in (fc — 1) -dimensional slabs, defined by two parallel half spaces with normals 
aligned along one of the coordinate axes. Specifically, the jth coordinate of the zth point is aj{pi), and if 
the range is on the jth permutation, then the slab is orthogonal to the jth coordinate axis. 

Another useful construction from an uncertain point set P is the set Pcert of all locations any point in P 
might occur. Specifically, for every uncertain point set P we can define the corresponding certain point set 



-fcert = {ji£[n]Pi — Uje[fc] ^j = Uie[n]je[fc 

Pcert by letting Xcen{pij) = x(Pi). for i G 
induced on Pcert by any coloring x of P as c 



Pi J. We can also extend any coloring x on P to a coloring in 
n] and j G [k]. Now we can naturally define the discrepancy 

isCxcert(^cert,^) = inaXrejiEp,^^eP,,nnrX{Pij)- 



From low-discrepancy to e-samples. There is a well-studied relationship between range spaces that 
admit low-discrepancy colorings, and creating e-samples of those range spaces |[T3l[35l [T2l [8l. The key 
relationship states that if disc(X, A) = ^ log'^(n), then there exists an e-sample of (P, A) of size 0{{'y/e) ■ 
\og^{'~i /e)) f38l, for values 7,6<j independent of n or e. Construct the coloring, and with equal probability 
discard either all points colored either —1 or those colored +1. This roughly halves the point set size, and 
also implies zero over-count in expectation for any fixed range. Repeat this coloring and reduction of points 
until the desired size is achieved. This can be done efficiently in a distributed manner through a merge- 
reduce framework ||T3l; The take-away is that a method for a low-discrepancy coloring directly implies 
a method to create an e-sample, where the counting error is in expectation zero for any fixed range. We 
describe and extend these results in much more detail in Appendix [A] 

3 RE Coresets 

First we will analyze e-RE coresets through the Pcert interpretation of uncertain point set P. The canonical 
transversals Pj of P will also be useful. In Section [3^ we will relate these results to a form of discrepancy. 



Lemma 3.1. T <Z P is an e-RE coreset for (P, A) if and only ifTcert C Pcert i^ on e-sample for {Pcert, ■^)- 

Proof. First note that since Pr[pi = pij] = | Vi, j, hence by linearity of expectations we have that 

^Qmp[\Q n r\] = ^27=1 ^[\Pi ^ ^l] = il-fcert H r\. Now, direct computation gives us: 



l-PcertHrl IrcertHrl 
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\Er(P) - Er(T)\ < ^■ 
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The next implication enables us to determine an e-RE coreset on P from e-samples on each Pj d P. 
Recall Pj is the jth canonical transversal of P for j G [k], and is defined similarly for a subset T C P asTj. 

Lemma 3.2. Given a range space {Pcert, -A), if we have T C P such that Tj is an e-sample for {Pj,A)for 
all j G [k], then T is an e-RE coreset for {P,A). 



Proof. Consider an arbitrary range r G 3?, and compute directly |-E'r.(p) — E^(j^'^\. Recalling that -Ej.(p) 
i^i^^ and observing that |Pcert| = k\P\, we get that: 



\E, 



r(P) 



E. 



HT)\ 



E?=il^.nr| eU\Tj^^ 



k\p\ 



k\T\ 
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\Pj nr| 
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\Tjnr\ 



<^{ke) = e. 
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3.1 Random Sampling 

We show that a simple random sampling gives us an e-RE coreset of P. 

Theorem 3.1. For an uncertain points set P and range space (Pcen, ■^) with VC-dimension u, a random 
sample T C P of size 0((l/e^)(i^ + log{k/6))) is an e-RE coreset of{P, 3) with probability at least 1 — 5. 

Proof. A random sample Tj of size 0{{l/e^){v+log{l/6'))) is an e-sample of any {Pj,A) with probability 
at least 1 — 6' [31]. Now assuming T C P resulted from a random sample on P, it induces the k disjoint 
canonical transversals Tj on T, such that Tj C Pj and \Tj\ = 0{{l/e'^){u + log{l/5'))) for j G [k]. Each 
Tj is an e-sample of (Pj, yi) for any single j G [k] with probability at least 1 — 6'. Following Lemma 3.2 and 



using union bound, we conclude that T C P is an e-RE coreset for uncertain point set P with probability at 
least 1 — k6'. Setting 6' = 5/k proves the theorem. D 



3.2 RE-Discrepancy and its Properties 

Next we extend the well-studied relationship between geometric discrepancy and e-samples on certain data 
towards e-RE coresets on uncertain data. 

We first require precise and slightly non-standard definitions. 

We introduce a new type of discrepancy based on the expected value of uncertain points called RE- 
discrepancy. Let P^ and P~ denote the sets of uncertain points from P colored +1 or —1, respectively, 
by X- Then RE-disC;^(P, r) = \P\ ■ \E^,p+-. — -Ej.(P)l for ^^y r ^ A. The usual extensions then follow: 
RE-disCx(P,yi) = maxreyi RE-disc(P, r) and RE-disc(P,yi) = min^ RE-disCj^(P,yi). Note that {P,A) is 
technically not a range space, since A defines subsets of Pcert in this case, not of P. 



7 log" 
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and \P&\ 



Lemma 3.3. Consider a coloring x : P — )■ {—1, +1} such that RE-disc^{P,A) 
n/2. Then the set P+ is an e-RE coreset of (P, ^l) with e = ^ log(n). 

Furthermore, if a subset T C P has size n/2 and is an (^ \og'^{n))- RE coreset, then it defines a coloring 
X (where x{Pi) = +^forpi G T) that has RE-disc^{P,A) = ^\og^{n). 

Proof. We prove the second statement, the first follows symmetrically. We refer to the subset T as P+. Let 

r = argmaxr'gyi|£;^/(p) - E^,^p+-^\. This implies ^ log"^ n > |Sr(p) - E^^^p+-^\ = ^RE-disC;^(P,r). D 



I Pcert nr I 
k\P\ 



k\T\ 



We can now recast RE-discrepancy to discrepancy on Pcert- From Lemma 3.1 
\Ej.tp\ — E^irp\ I and after some basic substitutions we obtain the following. 

Lemma 3.4. RE-disc^{P,A) = ldisc^^^^i.{Pcert,-^)- 

This does not immediately solve e-RE coresets by standard discrepancy techniques on Pcert because we 
need to find a coloring x on P. A coloring Xcert on Pcert may not be consistent across all pij G Pi. The 
following lemma allows us to reduce this to a problem of coloring each canonical transversal Pj. 



Lemma 3.5. RE-disc^{P,A) < maxj c//sC;^^^^,(Pj,yi). 



Proof. For any r ^ A and any coloring x (and the corresponding Xcert). we can write P as a union of 
disjoint transversals Pj to obtain 



discx,ert(^cert,r; 



k k 

Y^ Y^ Xcert{pij) <Y1 Yl XcertiVij) 

j=lpij&Pjr\r j=l Pij&PjHr 

k 



Since this holds for every r £ A, hence (using Lemma 3.4 1 



RE-d\sc^{P, A) = -d\5C^^^,,{Pcen,A) < maxdisC;^^^^(Pj,yi). 

ft .7 
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3.3 e-RE Coresets in M^ 

Lemma 3.6. Consider uncertain point set P with Pcen C M^ and the range space {Pcen, J+) with ranges 
defined by one-sided intervals of the form (— oo,x], then RE-disc{P,3) = 0{vklogn). 



Proof. Spencer et. al. [41] show that disc((P, S^), J^) is 0{Vklogn). Since we obtain the S^ from the 
canonical transversals Pi through P^, by definition this results in upper bounds on the the discrepancy over 
all Pj (it bounds the max). Lemma 3.5 then gives us the bound on RE-disc(P, J). D 



As we discussed in Appendix |A] the low RE-discrepancy coloring can be iterated in a merge-reduce 
framework as developed by Chazelle and Matousek llT3l . With Theorem A.2 we can prove the following 
theorem. 

Theorem 3.2. Consider uncertain point set P and range space {Pcen-, ?+) with ranges defined by one-sided 
intervals of the form (— oo, x], then an e-REcoreset can be constructed of size 0{{vk/e) log{k/e)). 

Since expected value is linear, RE-disc-^(P, (— oo, x])— RE-disCj^(P, (— oo, y)) = RE-disC;^(P, [y, x]) for 
y < X and the above result also holds for the family of two-sided ranges 3. 



3.4 e-RE Coresets for Rectangles in W^ 

Here let P be a set of n uncertain points where each possible location of a point pij G R'^. We consider a 

range space {Pcert,^d) defined by d-dimensional axis-aligned rectangles. 

Each canonical transversal Pj for j E [k] no longer implies a unique permutation on the points (for d > 1). 

But, for any rectangle r € 3?, we can represent any r n Pj as the disjoint union of points Pj contained in 

intervals on a predefined set of (1 + logn) ^ permutations ||9|- Spencer et al. 141] showed there exists a 

coloring x such that 

maxdisC;^(P,-,3i) = 0{De{n)log'^-^ n), 
j 

where £ = {1 + logn) ^ is the number of defined permutations and Di{n) is the discrepancy of l per- 
mutations over n points and ranges defined as intervals on each permutation. Furthermore, they showed 

Di{n) =0{^^nogn). 



To get the RE-discrepancy bound for Pcert = ^j=iPj, we first decompose Pcert into the k point sets Pj 
of size n. We then obtain (1 + log n) ^ permutations over points in each Pj, and hence obtain a family S^ 
of £ = k{l + logn) ^ permutations over all Pj. Df{n) = 0{Vilogn) yields 



r 

d+l 



n] 



disc((P,S,),J,) = 0(^/^log— 
Now each set Pj n r for r € Jld, can be written as the disjoint union of 0(log "^ n) intervals of E^. 
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Summing up over each interval, we get that disc(Pj, 31) = 0(\/fclog~2~ n) for each j. By Lemma 
bounds the RE-discrepancy as well. Finally, we can again apply the merge-reduce framework of Chazelle 
and Matousek 1131 (via Theorem A.2[ ) to achieve an e-RE coreset. 



Theorem 3.3. Consider uncertain point set P and range space {Pcert, '^d) (far d > 1) with ranges defined 
by axis-aligned rectangles in M"'. Then an e- RE coreset can be constructed of size 0{{\/k/e) log 2 (/c/e)). 

4 RC Coresets 

Recall that an e-RC coreset T of a set P of n uncertain points satisfies that for all queries r £ A and all 
thresholds r E [0,1] we have \Gp^r{T) — GT,r{T)\ < s, where Gp^r{T) represents the fraction of points 
from P that are in range r with probability at least r. 

In this setting, given a range r & A and a threshold r G [0, 1] we can let the pair (r, r) G 71 x [0, 1] define 
a range P^.t such that each pj G P is either in or not in Rt,t- Let (P, A x [0, 1]) denote this range space. 
If (Pcert, ■^) has VC-dimension v, then {P,A x [0, 1]) has VC-dimension 0{v + 1); see Corollary 5.23 in 
On . This implies that random sampling works to construct e-RC coresets. 



Theorem 4.1. For uncertain point set P and range space {Pcen-, -A) with VC-dimension u, a random sample 
T <Z P of size 0{{l/£^){u + \og{l/6))) is an e-RC coreset of (P, A) with probability at least 1 — 5. 



Yang et al. propose a similar result 11461 as above, without proof. 

4.1 RC Coresets in M} 

Constructing e-RC coresets when the family of ranges J+ represents one-sided, one-dimensional intervals 
is much easier than other cases. It relies heavily on the ordered structure of the canonical permutations, and 
thus discrepancy results do not need to decompose and then re-compose the ranges. 

Lemma 4.1. A point pi £ P is in range r G J+ with probability at least r = t/k if and only ifpi^t G rCiPt. 

Proof. By the canonical permutations, since for all i G [n], we require pij < Pi,j+i, then if pi^t G r, it 
follows that Pi J G r for j < t. Similarly if pi^t ^ r, then all pij ^ r for j > t. D 

Thus when each canonical permutation is represented upto an error e by a coreset T, then each threshold r 
is represented within e. Hence, as with e-RE coresets, we invoke the low-discrepancy coloring of Bohus [9| 
and Spencer et al. BTl . and then iterate them (invoking Theorem ] A. 1[ ) to achieve a small size e-RC coreset. 

Theorem 4.2. For uncertain point set P and range space {Pcert,'^+) with ranges defined by one-sided 
intervals of the form (—00, a]. An e-RC coreset of {P, 3+) can be constructed of size 0{{yk/e)\og{k/e)). 



Extending Lemma [471] from one-sided intervals of the form [—00, a] G J+ to intervals of the form [a, b] G 
J turns out to be non-trivial. It is not true that Gp^ya,b]{'^) = Gpy^oo,b]{'^) ~ G'pj_oo.a](''")> hence the two 
queries cannot simply be subtracted. Also, while the set of points corresponding to the query Gpr oo,al(|) 



dim 1 




dim 3 




Figure 2: Uncertain point pi queried by range [a, b]. Lifting shown to pf along dimensions 1 and 2 (left) and along 
dimensions 2 and 3 (right). 



are a contiguous interval in the tth permutation we construct in Lemma 4. 1 the same need not be true of 
points corresponding to Cpj^ b](|). This is a similar difficulty in spirit as noted by Kaplan et al. |[29l in 
the problem of counting the number of points of distinct colors in a box where one cannot take a naive 
decomposition and add up the numbers returned by each subproblem. 

We give now a construction to solve this two-sided problem for uncertain points in M^ inspired by that 
of Kaplan et al. |[29l . but we require specifying a fixed value of t G [k\. Given an uncertain point pi G P 
assume w.l.o.g that pij < Pij+i- Also pretend there is a point Pi^k+i = V where rj is larger than any 6 G M^ 
from a query range [a, b] (essentially rj = oo). Given a range [a, b], we consider the right-most set of t 
locations of pi (here {pij^t, ■ ■ ■ ^Pij}) that are in the range. This satisfies (i) Pij-t > a, (ii) Pij < b, and 
(iii) to ensure that it is the right-most such set, Pi.j+i > b. 

To satisfy these three constraints we re-pose the problem in M^ to designate each contiguous set of t 
possible locations of pj as a single point. So for t < j < k, we map ptj to p* = {pij-t,Pij,Pij+i). 
Correspondingly, a range r = [a, b] is mapped to a range f = [a, co) x (— oo, b] x {b, oo); see Figure|2] Let 
pj denote the set of all pj , and let P* represent |J^ pj. 



Lemma 4.2. pi is in interval r = [a, 6] with threshold at least t/k if and only ifp\ n r* > 1. Furthermore, 
no two points pij , Pij' G Pi can map to points p\ ■ , p\ -, such that both are in a range f*. 

Proof. Since pi^j < Pij+i, then if pij^t > a it implies all pi^i > a for i > j — t, and similarly, if Pij < b 
then all pi^i < b for all £ < j. Hence if pj satisfies the first two dimensional constraints of the range f*, it 
implies t points Pij-t ■ ■ ■ ,Pi,j are in the range [a, b]. Satisfying the constraint of f* in the third coordinate 
indicates that Pij+i ^ [a, b]. There can only be one point pij which satisfies the constraint of the last two 
coordinates that pij < b < Ptj+i- And for any range which contains at least t possible locations, there 
must be at least one such set (and only one) of t consecutive points which has this satisfying p 



'«j- 



D 



Corollary 4.1. Any uncertain point set P ^M} of size n and range r = [a, b] has Gp^ri^) = |-P* H r*|/n. 

This presents an alternative view of each uncertain point in M} with k possible locations as an uncertain 
point in M^ with k — t possible locations (since for now we only consider a threshold r = t/k). Where 3 
represents the family of ranges defined by two-sided intervals, let 3 be the corresponding family of ranges 
in M.^ of the form [a, oo) x (— oo, b] x (6, oo) corresponding to an interval [a, b] G 3. Under the assumption 
(valid under the lifting defined above) that each uncertain point can have at most one location fall in each 
range, we can now decompose the ranges and count the number of points that fall in each sub-range and 



add them together. Using the techniques (described in detail in Section 3.4 1 of Bohus |9| and Spencer et 
al. iHTl we can consider £ = {k — t){l + [logn])^ permutations of P^rt ^^ch that each range f £ 3 can 
be written as the points in a disjoint union of intervals from these permutations. To extend low discrepancy 



to each of the k distinct values of threshold t, there are k such liftings and h = k ■ £ = 0{k'^ log^ n) such 
permutations we need to consider. We can construct a coloring x '■ P -^ {~li+l} such that intervals on 
each permutation has discrepancy 0{-\/h\og n) = 0{k log^ n). Recall that for any fixed threshold t we only 
need to consider the corresponding £ permutations, hence the total discrepancy for any such range is at most 
the sum of discrepancy from all corresponding i = 0{k log^ n) permutations or 0{k^ log^ n). Finally, this 
low-discrepancy coloring can be iterated (via Theorem |A.l| ) to achieve the following theorem. 

Theorem 4.3. Consider an uncertain point set P along with ranges J of two-sided intervals. We can 
construct an e-RC coreset T for {P, 3) of size 0{{k'^/e) log (k/e)). 

4.2 RC Coresets for Rectangles in W^ 

The approach for 3 can be further extended to 31^, axis-aligned rectangles in M'^. Again the key idea is to 
define a proxy point set P such that |f n P| equals the number of uncertain points in r with at least threshold 
t. This requires a suitable lifting map and decomposition of space to prevent over or under counting; we 
employ techniques from Kaplan et al. ||29]| . 

First we transform queries on axis-aligned rectangles in M^ to the semi-bounded case in M^^. Denote the 
j;j-coordinateof apoint gas j;j((7), we double all the coordinates of each point g = {xi{q), ...,xe{q), ...,Xd{q)) 
to obtain point 

Q = {-xi{q),xi{q)..., -Xe{q), xe{q), ..., -Xd{q),Xd{q)) 

in M?"^. Now answering range counting query n«=i i"^*' ^«] i^ equivalent to solving the query 

d 

JJ[(-oo,-ai] X {-co, bi]] 

i=l 

on the lifted point set. 

Based on this reduction we can focus on queries of negative orthants of the form ni=i(~oo, ai] and 
represent each orthant by its apex a = (oi, ..., a^) G M'' as Q~ . Similarly, we can define Q^ as positive 
orthants in the form Y{f^i[ai, oo) C M'^. For any point set A C M"^ define U{A) = UaeAQt- 

A tight orthant has a location oi pi ^ P incident to every bounding facet. Let Ci^t be the set of all apexes 
representing tight negative orthants that contain exactly t locations of pj; see Figure [3ja). An important 
observation is that query orthant Q^ contains pi with threshold at least t if and only if it contains at least 
one point from Ci^f 

Let Qff. = UceCi tQt be the locus of all negative orthant query apexes that contain at least t locations of 
Pi; see Figurelslb). Notice that Q+ = U{Ci^ 



H,t) 



Lemma 4.3. For any point set pi C Mr of k points and some threshold 1 < t < k, we can decompose 
U{Ci^t) into f{k) = 0{k ) pairwise disjoint boxes, B{Ci^t)- 

Proof. Let M{A) be the set of maximal empty negative orthants for a point set A, such that any m G M{A) 
is also bounded in the positive direction along the 1st coordinate axis. Kaplan et al. ||29]| show (within 
Lemma 3.1) that |M(74)| = |-B(A)| and provide a specific construction of the boxes B. Thus we only need 
to bound \M{Ci^t) \ to complete the proof; see M{Ci^t) in Figurelstc). We note that each coordinate of each 
c G Ci^t must be the same as some pij G pi. Thus for each coordinate, among all c G Ct^t there are at most 
k values. And each maximal empty tight orthant m G M{Ci^t) is uniquely defined by the d coordinates 
along the axis direction each facet is orthogonal to. Thus \M{Ci^t)\ < k'^, completing the proof. D 



Note that as we are working in a Ufted space M , this corresponds to U{Ci^t) being decomposed into 
f{k) = 0{k'^'^) pairwise disjoint boxes in which d is the dimensionality of our original point set. 
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Figure 3: Illustration of uncertain point pi E M^ with fc = 8 and t ~ 3. (a) All tight negative orthants containing 
exactly i = 3 locations of pi, their apexes are Ci^t = {ci,C2}. (b): U{Ci^t) is shaded and query Q^. (c): M{Ci_t), 
the maximal negative orthants of Ci^t that are also bounded in the a;-direction. 

Lemma 4.4. For negative orthant queries Q^ with apex a on uncertain point set P, a point pi £ P is in Q~ 
with probability at least t/k if a is in some box in B{Ci^t), cind a will lie in at most one box from B[Ci^t)- 

Proof. The query orthant Q^ contains point pi with threshold at least t if and only if Q~^ contains at least 
one point from Ci^t and this happens only when a G U{Ci^t)- Since the union of constructed boxes in 
B{Ci^t) is equivalent to U{Ci^t) and they are disjoint, the result follows. D 

Corollary 4.2. The number of uncertain points from P in query range Q~ with probability at least t/k is 
exactly the number of boxes in W^B^Ci^t) that contain a. 

Thus for a set of boxes representing P, we need to perform count stabbing queries with apex a and show 
a low-discrepancy coloring of boxes. 

We do a second lifting by transforming each point a G M"^ to a semi-bounded box a = W,i=i{{—oo, aj] x 



[oj, oo)) and each box 6 G M'^ of the form W^ [xi, yi] to a point h = {xi,yi, ...,Xi, yg, 
is easy to verify that a G 6 if and only if 6 G a. 

Since this is our second doubling of dimension, we are now dealing with points in ^ 



> Xd, yd) in 



p2d 



. It 



fid 



Lifting P to P 



in M now presents an alternative view of each uncertain point j»j G P as an uncertain point pi in M with 
9k = 0{k'^'^) possible locations with the query boxes represented as5im . 



p4d 



We now proceed similarly to the proof of Theorem 4.3 For a fixed threshold t, obtain i = gk ■ {1 + 
[logn])^'^"^ disjoint permutations of P^^^ such that each range f £ 5i can be written as the points in a 
disjoint union of intervals from these permutations. For the k distinct values of t, there are k such liftings 



and h 

x-P- 



0(k- Qk- log 



4d-l 



n\ 



such permutations we need to consider, and we can construct a coloring 



{—1, +1} so that intervals on each permutation have discrepancy 



0{Vh\ogn) = 0(A;'^+hog" 



n 



Hence for any such range and specific threshold t, the total discrepancy is the sum of discrepancy from 
all corresponding £ = O (^^ • log^'^~^ n) permutations, or O ( k^'^^2 log^'^" 2 n ) . By applying the iterated 
low-discrepancy coloring (Theorem A.l| ), we achieve the following result. 

Theorem 4.4. Consider an uncertain point set P and range space {Pcen, '^d) with ranges defined by axis- 
aligned rectangles in M . Then an e-RC coreset can be constructed of size O I {k ^ 2 /e) log" ^ 2 [k/e) 



5 RQ Coresets 

In this section, given an uncertain point set P and its e-RE coreset T, we want to determine values e' and a 

so T is an (e', a)-RQ coreset. That is for any r G ^l and threshold r G [0, 1] there exists a 7 G [r — q, r + a] 
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such that 



Pre 



)mP 



'\Qnr\ 
. \Q\ 



< T 



PrsmT 



\Snr\ 



<7 



<e'. 



At a high level, our tack will be to realize that both |Q n r| and jS* n r| behave like Binomial random 
variables. By T being an e-RE coreset of P, then after normalizing, its mean is at most e-far from that of 
P. Furthermore, Binomial random variables tend to concentrate around their mean-and more so for those 
with more trials. This allows us to say jS* n rj/IS"! is either a-close to the expected value of jQ n r|/|(5| or 
is e'-close to or 1. Since IQ H r|/|(5| has the same behavior, but with more concentration, we can bound 
their distance by the a and e' bounds noted before. We now work out the details. 

Theorem 5.1. IfT is an e-RE coreset of P for e G (0, 1/2), then T is an {e',a)-RQ coreset for P for 

e', a G (0, 1/2) and satisfying a>e + v^(l/2|T|) ln(2/e'). 

Proof. We start by examining a Chernoff-Hoeffding bound on a set of independent random variables Xi so 
that each Xi € [oj, fej] with Aj = 6j — Oj. Then for some parameter /? G (0, J2i ^•i/2) 



Pr 



E^^ 



E^^ 



>/3 



< 2exp 



-2/?2 






Consider any r G yi. We now identify each random variable Xi = l{qi G r) (that is, 1 if (7j G r and 
otherwise) where qi is the random instantiation of some pi G T. So Xj G {0, 1} and Aj = 1. Thus by 
equating jS" n r| = ^ Xj 



Prs^T[\\Snr\-E[\Snr\]\>/3\S\]<2exp 



-2/3^|5|- 

y.A2 



2exp(-2/32|5|) <e'. 



Thus by solving for (3 (and equating \S\ = \T\) 

\Snr 
Prs^T 



\S\ 

Now by T being an e-RE coreset of P then 

|5nr| 



\Snr\ 



> 



2^^"^^; 



<e'. 



-5(=T 



|5| 



-Q^P 



\Qnr\ 
\Q\ 



< e. 



Combining these two we have 



PrsmT 



\Snr\ 



\S\ 



MP 



\Qnr\ 
\Q\ 



> a 



<e' 



for a = e + a/^jti ^^i^)- 

Combining these statements, for any x < M — a < M — a' we have e' > FT,r{x) > and e' > 
Fp,r{x) > (and symmetrically for x>M + a>M + a'). It follows that Ft^t- is an (e', a) -quantization 

of Fpr- 

Since this holds for any r G ^l, by T being an e-RE coreset of P, it follows that T is also an (e', a)-RQ 
coreset of P. D 

We can now combine this result with specific results for e-RE coresets to get size bounds for (e, a)-RQ 
coresets. To achieve the below bounds we set e = e'. 
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Corollary 5.1. For uncertain point set P with range space {Pcert, -A), there exists a (e, e+y^(l/2|T|) ln(2/£))- 
RQ core set of {P, A) of size \T\ = 

• 0{(l/e^){v + \og{k/6))) when A has VC-dimension u, with probability 1 — 5 (Theorem 3.1 ), 



0{{\/k/e) \og{k/e)) when A = 3 (Theorem 3.2 1, and 



3d-l 



O { (Vk/e) log^^ (k/e) ) when A = Jld (Theorem 



3.3). 



Finally we discuss why the a term in the (e', a)-RQ coreset T is needed. Recall from Section [5] that 
approximating the value of Eq^p [^p] with Es^t [~\sr] f^'" ^^^ '^ corresponds to a low-discrepancy 
sample of Pcert- Discrepancy error immediately implies we will have at least the e horizontal shift between 
the two distributions and their means, unless we could obtain a zero discrepancy sample of Pcert- Note this 
e-horizontal error corresponds to the a term in an (e', a)-RQ coreset. When P is very large, then due to the 
central limit theorem, Fp^r will grow very sharply around Eg^p [^ip] ■ In the worst case Ft^t may be i^(l) 

vertically away from Fp ,. on either side of EsmT [ \c\ ] > so no reasonable amount of e' vertical tolerance 
will make up for this gap. 

On the other hand, the e' vertical component is necessary since for very small probability events (that is 
for a fixed range r and small threshold r) on P, we may need a much smaller value of r (smaller by Q{1)) to 
get the same probability on T, requiring a very large horizontal shift. But since it is a very small probability 
event, only a small vertical e' shift is required. 

The main result of this section then is showing that there exist pairs (e', a) which are both small. 

6 Conclusion and Open Questions 

This paper defines and provides the first results for coresets on uncertain data. These can be essential tools 
for monitoring a subset of a large noisy data set, as a way to approximately monitor the full uncertainty. 

There are many future directions on this topic, in addition to tightening the provided bounds especially 
for other range spaces. Can we remove the dependence on k without random sampling? Can coresets be 
constructed over uncertain data for other queries such as minimum enclosing ball, clustering, and extents? 
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A Low Discrepancy to e-Coreset 

Mainly in the 90s Chazelle and Matousek lfT4l[T3l[34ll35l[T2l led the development of method to convert from 
a low-discrepancy coloring to a coreset that allowed for approximate range queries. Here we summarize and 
generalize these results. 

We start by restating a results of Phillips 1381 [39l which generalizes these results, here we state it a bit 
more specifically for our setting. 

Theorem A.l (Phillips B81 [39l ). Consider a point set P of size n and a family of subsets A. Assume an 
0{n") time algorithm to construct a coloring x : P ^^ {~1; +1} ^o disc^^{P,A) = 0(7 log*^ n) where /3, 
7, and uj are constant algorithm parameters dependent on A, but not P (or n). There exists an algorithm to 
construct an e-sample of{P,A) of size g{e,A) = 0{{'y/e) log'^(7/e)) in time 0{n ■ g{e,A)^~^). 

Note that we ignored non-exponential dependence on uj and /3 since in our setting they are data and 
problem independent constants. But we are more careful with 7 terms since they depend on k, the number 
of locations of each uncertain point. 

We restate the algorithm and analysis here for completeness, using g = g{e, A) for shorthand. Divide P 
into n/g parts {Pi, P2, . . . , Pn/g} of size k = 4(/3 + 2)g. Assume this divides evenly and n/g is a power 
of two, otherwise pad P and adjust 5 by a constant. Until there is a single set, repeat the following two 
stages. In stage 1, for /3 + 2 steps, pair up all remaining sets, and for all pairs (e.g. Pi and Pj) construct a 
low-discrepancy coloring x on Pj U Pj and discard all points colored — 1 (or + 1 at random). In the (/3 + 3)rd 
step pair up all sets, but do not construct a coloring and halve. That is every epoch (/3 + 3 steps) the size of 
remaining sets double, otherwise they remain the same size. When a single set remains, stage 2 begins; it 
performs the color-halve part of the above procedure until disc(P, ^l) < en as desired. 

We begin analyzing the error on a single coloring. 

Lemma A.l. The set P+ = {p £ P \ x{p) = +1} i^ '^n {disc^{P, A) / n)-sample of{P,A). 
Proof. 



max 
Rga 



\PnR\ |P+nP| 



\P\ IP^ 



max 

ReA 



\PnR\ -2|P+nP| 



n 



< 



d\5c^{P,A) 



n 



D 



We also note two simple facts IIT2I [35l : 

(5 1) If Qi is an e-sample of Pi and Q2 is an e-sample of P2, then Qi U Q2 is an e-sample of Pi U P2. 

(52) If Q is an ei -sample of P and S is an 82 sample of Q, then S is an (ei + e2) -sample of P. 



Note that (S 1) (along with Lemma A. 1 1 implies the arbitrarily decomposing P into n/g sets and constructing 
colorings of each achieves the same error bound as doing so on just one. And (S2) impUes that chaining 
together rounds adds the error in each round. It follows that if we ignore the (/3 + 3)rd step in each epoch, 
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then there is 1 set remaining after log{n/g) steps. The error caused by each step is disc((7, A)/g so the total 
error is log( n / g) (j log^ g)/g = e. Solving for g yields g = 0(^ log(^) log'^(^)). 

Thus to achieve the result stated in the theorem the (/3 + 3)rd step skip of a reduce needs to remove the 
log(ne/7) term from the error. This works! After /3 + 3 steps, the size of each set is 2g and the discrepancy 
error is 7 \og^{2g)/2g. This is just more than half of what it was before, so the total error is now: 

log("/g) 
/3+3 

Y, {P + ^)i\og^{2'g)/{2'g) = Q{p{^\og^g)/g) = e. 

Solving for g yields g = 0{^ log'^(l/e)) as desired. Stage 2 can be shown not to asymptotically increase 
the error. 

To achieve the runtime we again start with the form of the algorithm without the halve-skip on every 
(/3 + 3)rd step. Then the first step takes 0{{n/ g) ■ g^) time. And each ith step takes 0{{n/2'^^^)g^^^) time. 
Since each subsequent step takes half as much time, the runtime is dominated by the first 0{ng^^^) time 
step. 

For the full algorithm, the first epoch (/? + 3 steps, including a skipped halve) takes 0{ng^^^) time, and 
the ith epoch takes 0(n/2(^+2)*(5r2*)^-i) = 0(ng''"V2^*) time. Thus the time is still dominated by the 
first epoch. Again, stage 2 can be shown not to affect this runtime, and the total runtime bound is achieved 
as desired, and completes the proof. 

Finally, we state a useful corollary about the expected error being 0. This holds specifically when we 
choose to discard the set P^ or P^ = {p ^ P \ x{p) = — 1} at random on each halving. 



Corollary A.l. The expected error for any range R ^ A on the e-sample T created by Theorem A.l is 

\Rr\P\ \TnR\ 



\P\ \T\ 

Note that there is no absolute value taken inside E[-], so technically this measures the expected undercount. 

RE-discrepancy. We are also interested in achieving these same results for RE-discrepancy. To this end, 
the algorithms are identical. Lemma [33] replaces Lemma A.l (SI) and (S2) still hold. Nothing else about 



the analysis depends on properties of disc or RE-disc, so Theorem A.l can be restated for RE-discrepancy. 



Theorem A. 2. Consider an uncertain point set P of size n and a family of subsets A of Pcen- Assume an 
0{n^) time algorithm to construct a coloring x ■ -P — > {"Ij +1} ^^ RE-disc^{P, A) = 0(7 log"^ n) where 
f3, 7, and co are constant algorithm parameters dependent on A, but not P (or n). There exists an algorithm 
to construct an e-RE coreset of {P,A) ofsizeg{e,A) = 0{{'y/£)log'^{j/e)) in time 0{n ■ g(e,A)^~^). 
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